-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CL/HIER: bcast 2step algorithm #620
CL/HIER: bcast 2step algorithm #620
Conversation
4ae0f20
to
4c8ae5e
Compare
4c8ae5e
to
e08014e
Compare
@vspetrov looks like with new gtest asan found memleak, do you want to fix it in this PR. Or I can create fix in another PR |
ac489b1
to
9ffa8b2
Compare
goto out; | ||
} | ||
n_tasks++; | ||
if (root_on_local_node && (root != rank)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this condition is always false
if root != rank => root on remote node because we are inside if (SBGP_ENABLED(cl_team, NODE_LEADERS))
if root_on_local_node => root == rank because of the same reason
In that case you don't need first_task because first_task is always 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i beleive you are right. I re-checked and tested.
9ffa8b2
to
c322720
Compare
ucc_schedule_add_task(schedule, tasks[0]); | ||
if (n_tasks > 1) { | ||
if (root == rank) { | ||
ucc_task_subscribe_dep(&schedule->super, tasks[1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not correct, you can start 2 tasks (node leaders bcast and node bcast) in parallel only if root rank belongs to node leaders group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not correct, you can start 2 tasks (node leaders bcast and node bcast) in parallel only if root rank belongs to node leaders group.
If root is not node leader then it has only 1 task, and will not get there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but node leader task should depend on node task in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right.. this is fun. that is exactly why i had "first_task" before. and the condition above was correct actually to handle that case. i'll revert that last one change
c322720
to
2155d14
Compare
so, i added "first_task" back since it handles exactly the required case. Also, the reason gtest didn't catch the bug when i removed "first_task" is because the was incorrect "reset" callback in gtest bcast unit test. fixed that as well. |
2155d14
to
dbecc38
Compare
looks like persistent gtest hangs with new bcast 2 step. need to debug... |
Need to check with UCF folks on the copyright |
@vspetrov Please add the copyright blurb you want. It was cleared by UCF. |
dbecc38
to
15915dd
Compare
* CL/HIER: bcast 2step algorithm * TEST: gtest cl_hier_rab allreduce gtest * TEST: bcast cl_hier_2step gtest * CI: add enable-assert to default clang-tidy * CL/HIER: fix oob mem leak * REVIEW: address comments * TEST: bcast gtest "reset" fix * CL/HIER: bcast 2 step persistent fallback --------- Co-authored-by: Valentin Petrov <valentinp@nvidia.com>
What
Adds 2 lvl pipelined broadcast algorithm to CL/HIER
Perf improvement. Example on 32nodes, ppn=56.